Chapter 6 Lasso

Lasso (Tibshirani 1996) is among the most popular machine learning models. Different from the Ridge regression, its adds \(\ell_1\) penalty on the fitted parameters:

\[ \begin{align} \widehat{\boldsymbol \beta}^\text{ridge} =& \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} (\mathbf{y}- \mathbf{X}\boldsymbol \beta)^\text{T}(\mathbf{y}- \mathbf{X}\boldsymbol \beta) + n \lambda \lVert\boldsymbol \beta\rVert_1\\ =& \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \frac{1}{n} \sum_{i=1}^n (y_i - x_i^\text{T}\boldsymbol \beta)^2 + \lambda \sum_{i=1}^n |\beta_j|, \end{align} \] The main advantage of adding such a penalty is that small \(\widehat{\beta}_j\) values can be shrunk to zero. This may prevents over-fitting and also improve the interpretability especially when the number of variables is large. We will analyze the Lasso starting with a single variable case, and then discuss the application of coordinate descent algorithm to obtain the solution.

6.1 One-Variable Lasso and Shrinkage

To illustrate how Lasso shrink a parameter estimate to zero, let’s consider an orthogonal design matrix case, , i.e., \(\mathbf{X}^\text{T}\mathbf{X}= n \mathbf{I}\), which will eventually reduce to a one-variable problem. Note that the intercept term is not essential because we can always pre-center the observed data \(x_i\) and \(y_i\)s so that they can be recovered after this one variable problem. Our objective function is

\[\frac{1}{n}\lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert^2 + \lambda \lVert\boldsymbol \beta\rVert_1\] We are going to relate the solution the OLS solution, which exists in this case because \(\mathbf{X}^\text{T}\mathbf{X}\) is invertible. Hence, we have

\[\begin{align} &\frac{1}{n}\lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert^2 + \lambda \lVert\boldsymbol \beta\rVert_1\\ =&\frac{1}{n}\lVert \mathbf{y}- \color{OrangeRed}{\mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} + \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols}} - \mathbf{X}\boldsymbol \beta\rVert^2 + \lambda \lVert\boldsymbol \beta\rVert_1\\ =&\frac{1}{n}\lVert \mathbf{y}- \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} \rVert^2 + \frac{1}{n} \lVert \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} - \mathbf{X}\boldsymbol \beta\rVert^2 + \lambda \lVert\boldsymbol \beta\rVert_1 \end{align}\]

The cross-term is zero because the OLS residual term is orthogonal to the columns of \(\mathbf{X}\):

\[\begin{align} &2(\mathbf{y}- \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols})^\text{T}(\mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} - \mathbf{X}\boldsymbol \beta)\\ =& 2\mathbf{r}^\text{T}\mathbf{X}(\widehat{\boldsymbol \beta}^\text{ols} - \boldsymbol \beta)\\ =& 0 \end{align}\]

Then we just need to optimize the part that involves \(\boldsymbol \beta\):

\[\begin{align} &\underset{\boldsymbol \beta}{\mathop{\mathrm{arg\,min}}} \frac{1}{n}\lVert \mathbf{y}- \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} \rVert^2 + \frac{1}{n} \lVert \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} - \mathbf{X}\boldsymbol \beta\rVert^2 + \lambda \lVert\boldsymbol \beta\rVert_1\\ =&\underset{\boldsymbol \beta}{\mathop{\mathrm{arg\,min}}} \frac{1}{n} \lVert \mathbf{X}\widehat{\boldsymbol \beta}^\text{ols} - \mathbf{X}\boldsymbol \beta\rVert^2 + \lambda \lVert\boldsymbol \beta\rVert_1\\ =&\underset{\boldsymbol \beta}{\mathop{\mathrm{arg\,min}}} \frac{1}{n} (\widehat{\boldsymbol \beta}^\text{ols} - \boldsymbol \beta)^\text{T}\mathbf{X}^\text{T}\mathbf{X}(\widehat{\boldsymbol \beta}^\text{ols} - \boldsymbol \beta) + \lambda \lVert\boldsymbol \beta\rVert_1\\ =&\underset{\boldsymbol \beta}{\mathop{\mathrm{arg\,min}}} \frac{1}{n} (\widehat{\boldsymbol \beta}^\text{ols} - \boldsymbol \beta)^\text{T}n \mathbf{I}(\widehat{\boldsymbol \beta}^\text{ols} - \boldsymbol \beta) + \lambda \lVert\boldsymbol \beta\rVert_1\\ =&\underset{\boldsymbol \beta}{\mathop{\mathrm{arg\,min}}} \sum_{j = 1}^p (\widehat{\boldsymbol \beta}^\text{ols}_j - \boldsymbol \beta_j )^2 + \lambda \sum_j |\boldsymbol \beta_j|\\ \end{align}\]

This is a separable problem meaning that we can solve each \(\beta_j\) independently since they do not interfere each other. Then the univariate problem is

\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \,\, (\beta - a)^2 + \lambda |\beta|\] We learned that to solve for an optimizer, we can set the gradient to be zero. However, the function is not everywhere differentiable. Still, we can separate this into two cases: \(\beta > 0\) and \(\beta < 0\). For the positive side, we have

\[\begin{align} 0 =& \frac{\partial}{\partial \beta} \,\, (\beta - a)^2 + \lambda |\beta| = 2 (\beta - a) + \lambda \\ \Longrightarrow \quad \beta =&\, a - \lambda/2 \end{align}\]

However, this will maintain positive only when \(\beta\) is greater than \(a - \lambda/2\). The negative size is similar. And whenever \(\beta\) falls in between, it will be shrunk to zero. Overall, for our previous univariate optimization problem, the solution is

\[\begin{align} \hat\beta_j^\text{lasso} &= \begin{cases} \hat\beta_j^\text{ols} - \lambda/2 & \text{if} \quad \hat\beta_j^\text{ols} > \lambda/2 \\ 0 & \text{if} \quad |\hat\beta_j^\text{ols}| < \lambda/2 \\ \hat\beta_j^\text{ols} + \lambda/2 & \text{if} \quad \hat\beta_j^\text{ols} < -\lambda/2 \\ \end{cases}\\ &= \text{sign}(\hat\beta_j^\text{ols}) \left(|\hat\beta_j^\text{ols}| - \lambda/2 \right)_+ \end{align}\]

This implies that when \(\lambda\) is large enough, the estimated \(\beta\) parameter of Lasso will be shrunk towards zero. The following animated figure demonstrates how adding an \(\ell_1\) penalty can change the optimizer.

Reference

Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.